Data Analysis on REDD+¶

Introduction¶

The Paris Climate Agreement recognizes forests as a key part of the solution to the climate change challenge.Better land stewardship may provide 37% of the cost-effective climate change mitigation needed to keep global warming below 2C by 2030, with reforestation, avoided deforestation and natural forest management activities contributing to nearly two-thirds of this potential [1] REDD+ (reducing emissions from deforestation and forest degradation, and enhancing carbon stocks) was introduced at the UNFCCC Conference of Parties in Bali in 2007 and now more than 400 projects are registered under 'ongoing' status across countries as of July 2022. For this project exercise, I will focus on understanding REDD+ projects summary overview and developing some hypotheses to test its statistical significance, and report the result in Tableau dashboard.

Purpose¶

The dataset is downloaded from International Database on REDD+ projects and programs (IDRECCO) with updates from 22 July 2022. And the purpose of this exercise is to:

  • explore REDD+ projects
  • clean/prepare the dataset
  • develop hypothesis on areas that critical reach for conclusion
  • report the result in Tableau.

The dataset has been structured with 10 sheets in Excel with each sheets contains information focused on specific areas such as Project, Carbon Certification, Financing Source, Community Level Intervention and Host Country etc. So let' start.

1.0 Dataset preparation¶

Part One. Load the dataset and import libraries¶

In [1]:
import pandas as pd
import csv
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
import copy
sns.set() #setting the default seaborn style for our plots
In [2]:
data_original = pd.read_excel(r'C:\Users\Nomuun\Desktop\Env Econ\REDD.xlsx', sheet_name = "1. Project")

Part Two. Explore the Dataset¶

In [3]:
data_original.columns = [x.lower() for x in data_original.columns]
data_original.head(3)
Out[3]:
project id project name secondary name last idrecco update (yyyymmdd) size (in hectare) size of crediting area (in hectare) start year end year duration project description ... data quality - carbon transacations data quality - financing sources data quality - community interventions status longitude (decimal degrees) latitude (decimal degrees) multiple locations? region jurisdiction level 1 jurisdiction level 2
0 100 PRC Commercial reforestation on lands dedicated to... 2020-09-04 3137.0 9999.0 2000 2030 30 The proposed A/R CDM project activity consists... ... good data good data good data Ongoing -74.647083 9.930694 No South America Department : Magdalena Municipality : El Banco
1 101 Sierra Gorda Premium Carbon: Carbon Sequestrat... Carbon Sequestration in Communities of Extreme... 2020-09-04 247.0 247.0 1997 2042 46 Bosque Sustentable A.C. is working with privat... ... good data good data good data Ongoing 9999.999999 9999.999999 Non South America State : Querétaro and State : San Luis Potosí Municipality : Pinal de Amoles, Jalpan de Serr...
2 102 Scolel 'te Scolel té Natural Resources Management and Car... 2020-09-04 9049.0 7662.0 1997 2027 30 Scolel Té is a project that assists farmers an... ... good data good data good data Ongoing -90.680747 16.336757 Yes South America State : Chiapas and State : Oaxaca Municipality : Tuxtla Gutiérrez

3 rows × 41 columns

In [4]:
data_original.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 41 columns):
 #   Column                                                           Non-Null Count  Dtype         
---  ------                                                           --------------  -----         
 0   project id                                                       624 non-null    int64         
 1   project name                                                     624 non-null    object        
 2   secondary name                                                   261 non-null    object        
 3   last idrecco update (yyyymmdd)                                   624 non-null    datetime64[ns]
 4   size (in hectare)                                                624 non-null    float64       
 5   size of crediting area (in hectare)                              624 non-null    float64       
 6   start year                                                       624 non-null    int64         
 7   end year                                                         624 non-null    int64         
 8   duration                                                         624 non-null    int64         
 9   project description                                              620 non-null    object        
 10  objective 1                                                      624 non-null    object        
 11  objective 2                                                      624 non-null    object        
 12  objective 3                                                      623 non-null    object        
 13  deforestation drivers                                            624 non-null    object        
 14  type of forest                                                   624 non-null    object        
 15  project type                                                     624 non-null    object        
 16  details for afforestation/reforestation activity                 324 non-null    object        
 17  project located in an iucn protected area?                       624 non-null    object        
 18  dominant type                                                    624 non-null    object        
 19  was fpic used?                                                   624 non-null    object        
 20  was participatory approach used?                                 624 non-null    object        
 21  id_country                                                       624 non-null    int64         
 22  location description                                             595 non-null    object        
 23  project partners                                                 448 non-null    object        
 24  information sources                                              623 non-null    object        
 25  name of the protected area                                       505 non-null    object        
 26  size of the protected area (hectares)                            505 non-null    float64       
 27  estimated proportion of the project located in a protected area  505 non-null    object        
 28  category of protected area (iucn classification)                 504 non-null    object        
 29  type of community participation                                  624 non-null    object        
 30  data quality - carbon certification                              624 non-null    object        
 31  data quality - carbon transacations                              624 non-null    object        
 32  data quality - financing sources                                 624 non-null    object        
 33  data quality - community interventions                           624 non-null    object        
 34  status                                                           624 non-null    object        
 35  longitude (decimal degrees)                                      624 non-null    float64       
 36  latitude (decimal degrees)                                       624 non-null    float64       
 37  multiple locations?                                              624 non-null    object        
 38  region                                                           624 non-null    object        
 39  jurisdiction level 1                                             604 non-null    object        
 40  jurisdiction level 2                                             571 non-null    object        
dtypes: datetime64[ns](1), float64(5), int64(5), object(30)
memory usage: 200.0+ KB
In [5]:
data_original.isnull().mean()
Out[5]:
project id                                                         0.000000
project name                                                       0.000000
secondary name                                                     0.581731
last idrecco update (yyyymmdd)                                     0.000000
size (in hectare)                                                  0.000000
size of crediting area (in hectare)                                0.000000
start year                                                         0.000000
end year                                                           0.000000
duration                                                           0.000000
project description                                                0.006410
objective 1                                                        0.000000
objective 2                                                        0.000000
objective 3                                                        0.001603
deforestation drivers                                              0.000000
type of forest                                                     0.000000
project type                                                       0.000000
details for afforestation/reforestation activity                   0.480769
project located in an iucn protected area?                         0.000000
dominant type                                                      0.000000
was fpic used?                                                     0.000000
was participatory approach used?                                   0.000000
id_country                                                         0.000000
location description                                               0.046474
project partners                                                   0.282051
information sources                                                0.001603
name of the protected area                                         0.190705
size of the protected area (hectares)                              0.190705
estimated proportion of the project located in a protected area    0.190705
category of protected area (iucn classification)                   0.192308
type of community participation                                    0.000000
data quality - carbon certification                                0.000000
data quality - carbon transacations                                0.000000
data quality - financing sources                                   0.000000
data quality - community interventions                             0.000000
status                                                             0.000000
longitude (decimal degrees)                                        0.000000
latitude (decimal degrees)                                         0.000000
multiple locations?                                                0.000000
region                                                             0.000000
jurisdiction level 1                                               0.032051
jurisdiction level 2                                               0.084936
dtype: float64
In [6]:
sns.heatmap(data_original.isnull(), cbar=False)
Out[6]:
<AxesSubplot:>

As reflected on the heatmap the following attributes has missing data:

  • High - Secondary name has missing values of ~58%, Details for Afforestation/Reforestation activity has ~48%, Project partners has ~28%
  • Medium - Size of the protected area (hectares), Estimated proportion of the project located in a protected area and Category of protected area (IUCN classification) all are ~19%, Jurisdiction level 2 has ~8%
  • Low - Less than ~5% below for Project description, Objective 3, Location description, Jurisdiction level 1 and Jurisdiction level 2

For data types, its seems all attributes were correctly recognized. Naming convention didn't follow python preferred snake case, and it will cause some issues along the analysis, so to rename the columns after removing not used rows from the dataset.

In [7]:
data = data_original.drop(['last idrecco update (yyyymmdd)' 
                           ,'longitude (decimal degrees)' 
                           ,'latitude (decimal degrees)'
                           ,'jurisdiction level 1'
                           ,'jurisdiction level 2'
                           ,'type of community participation'
                           ,'multiple locations?'
                           ,'category of protected area (iucn classification)'
                           ,'data quality - financing sources'
                           ,'name of the protected area'
                           ,'information sources'
                           ,'location description'
                           ,'project partners'
                           ,'type of forest'
                           ,'was fpic used?'
                           ,'dominant type'
                           ,'data quality - carbon certification'
                           ,'data quality - carbon transacations'
                           ,'data quality - community interventions'
                           ,'project located in an iucn protected area?'
                           ,'was participatory approach used?'
                           ,'details for afforestation/reforestation activity'
                           ]
                          ,axis = 1)
In [8]:
data = data.rename(columns = 
                           {'project id':'project_id'
                            , 'project name': 'project_name'
                            , 'secondary name': 'secondary_name'
                            , 'size (in hectare)': 'size_in_hectare'
                            , 'size of crediting area (in hectare)': 'size_of_crediting_area_in_hectare'
                            , 'start year': 'start_year'
                            , 'end year': 'end_year'
                            , 'project description' : 'project_description'
                            , 'objective 1' : 'objective_1'
                            , 'objective 2' : 'objective_2'
                            , 'objective 3' : 'objective_3'
                            , 'deforestation drivers' : 'deforestation_drivers'
                            , 'project type' : 'project_type'
                            , 'size of the protected area (hectares)' : 'size_of_pa_hectares'
                            , 'estimated proportion of the project located in a protected area': 'estimated_proportion_of_project_located_in_pa'
                           })
In [9]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 19 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   project_id                                     624 non-null    int64  
 1   project_name                                   624 non-null    object 
 2   secondary_name                                 261 non-null    object 
 3   size_in_hectare                                624 non-null    float64
 4   size_of_crediting_area_in_hectare              624 non-null    float64
 5   start_year                                     624 non-null    int64  
 6   end_year                                       624 non-null    int64  
 7   duration                                       624 non-null    int64  
 8   project_description                            620 non-null    object 
 9   objective_1                                    624 non-null    object 
 10  objective_2                                    624 non-null    object 
 11  objective_3                                    623 non-null    object 
 12  deforestation_drivers                          624 non-null    object 
 13  project_type                                   624 non-null    object 
 14  id_country                                     624 non-null    int64  
 15  size_of_pa_hectares                            505 non-null    float64
 16  estimated_proportion_of_project_located_in_pa  505 non-null    object 
 17  status                                         624 non-null    object 
 18  region                                         624 non-null    object 
dtypes: float64(3), int64(5), object(11)
memory usage: 92.8+ KB

2.0 Projects by Country¶

In [10]:
df_c = pd.read_excel(r'C:\Users\Nomuun\Desktop\Env Econ\REDD.xlsx', sheet_name = "9. Country")
df_c.columns = [x.lower() for x in df_c.columns]
df_c.columns
Out[10]:
Index(['id_country', 'country name', 'human development index (2019)',
       'gdp (billion usd, 2019)', 'gdp per capita (usd/person, 2019)',
       'population (2019)', 'forest area (2020, ha)', 'forest loss (2020)',
       'annual deforestation rate 2015-2020 (%)',
       'index of government effectiveness (2018)',
       'index of corruption control (2018)',
       'participation in global redd+ programs', 'forest tenure (2015)',
       'comment'],
      dtype='object')
In [11]:
df_c = df_c.rename(columns = 
                           {'country name':'country_name'
                            , 'human development index (2019)': 'hdi_2019'
                            , 'gdp (billion usd, 2019)': 'gdp_bln_usd_2019'
                            , 'gdp per capita (usd/person, 2019)': 'gdp_per_capita_usd_2019'
                            , 'forest area (2020, ha)': 'forest_area_2020_ha'
                            , 'forest loss (2020)' : 'forest_loss_2020'
                            , 'annual deforestation rate 2015-2020 (%)': 'annual_deforestation_rate_2015_2020'
                            , 'index of government effectiveness (2018)' : 'index_of_government_effectiveness_2018'
                            , 'index of corruption control (2018)' : 'index_of_corruption_control_2018'
                            , 'participation in global redd+ programs' : 'participation_in_global_redd+_programs'
                            , 'forest tenure (2015)' : 'forest_tenure_2015'
                            , 'deforestation drivers' : 'deforestation_drivers'
                            , 'project type' : 'project_type'
                            
                           })
In [12]:
merged_c = pd.merge(left = data, right = df_c, on = 'id_country')
merged_c.head(3)
Out[12]:
project_id project_name secondary_name size_in_hectare size_of_crediting_area_in_hectare start_year end_year duration project_description objective_1 ... gdp_per_capita_usd_2019 population (2019) forest_area_2020_ha forest_loss_2020 annual_deforestation_rate_2015_2020 index_of_government_effectiveness_2018 index_of_corruption_control_2018 participation_in_global_redd+_programs forest_tenure_2015 comment
0 100 PRC Commercial reforestation on lands dedicated to... 3137.0 9999.0 2000 2030 30 The proposed A/R CDM project activity consists... biodiversity conservation ... 6432.4 50.33944 59141.91 -198.55 -0.332378 -0.085226 -0.301493 FCPF|UNREDD|BioCarbon Fund Initiative for Sust... 65.96% public; 30.42% private; 3.61% unknown. NaN
1 133 Pachamama NaN 1980.0 1980.0 2009 2034 25 Pachamama Forest is developing four different ... biodiversity conservation ... 6432.4 50.33944 59141.91 -198.55 -0.332378 -0.085226 -0.301493 FCPF|UNREDD|BioCarbon Fund Initiative for Sust... 65.96% public; 30.42% private; 3.61% unknown. NaN
2 134 San Nicolas Carbon Sequestration Project San Nicolas CDM Reforestation Project 1101.0 1101.0 2008 2027 20 The project development objective is to pionee... development;social development ... 6432.4 50.33944 59141.91 -198.55 -0.332378 -0.085226 -0.301493 FCPF|UNREDD|BioCarbon Fund Initiative for Sust... 65.96% public; 30.42% private; 3.61% unknown. NaN

3 rows × 32 columns

In [13]:
merged_c.describe().T
Out[13]:
count mean std min 25% 50% 75% max
project_id 624.0 4.553237e+02 2.225397e+02 100.000000 266.750000 430.500000 622.250000 8.840000e+02
size_in_hectare 624.0 1.730173e+06 1.329796e+07 4.000000 2000.000000 10000.000000 87835.000000 2.377650e+08
size_of_crediting_area_in_hectare 624.0 2.520304e+05 1.975679e+06 5.000000 5345.750000 9999.000000 9999.000000 3.165530e+07
start_year 624.0 2.752077e+03 2.321714e+03 1979.000000 2008.000000 2011.000000 2014.000000 9.999000e+03
end_year 624.0 3.787580e+03 3.297166e+03 2006.000000 2032.000000 2042.000000 2075.500000 9.999000e+03
duration 624.0 2.155784e+03 4.085369e+03 1.000000 25.000000 30.000000 60.000000 9.999000e+03
id_country 624.0 3.585321e+02 2.386287e+02 24.000000 156.000000 356.000000 550.500000 8.940000e+02
size_of_pa_hectares 505.0 8.039826e+04 3.806587e+05 5.000000 9999.000000 9999.000000 9999.000000 4.268100e+06
hdi_2019 624.0 6.750801e-01 1.080469e-01 0.377000 0.579000 0.707000 0.761000 8.470000e-01
gdp_bln_usd_2019 624.0 1.682586e+03 3.740241e+03 0.283990 47.319620 272.119700 1258.286720 1.434290e+04
gdp_per_capita_usd_2019 624.0 5.329442e+03 3.839604e+03 411.600000 1816.500000 4620.000000 8717.200000 1.727650e+04
population (2019) 624.0 2.471158e+02 4.325955e+02 0.299880 28.585490 52.573970 211.049530 1.397715e+03
forest_area_2020_ha 624.0 1.117752e+05 1.554321e+05 44.980000 12429.810000 59141.910000 92133.200000 4.966196e+05
forest_loss_2020 624.0 -1.668631e+02 7.812104e+02 -1453.040000 -469.000000 -127.766000 0.000000 1.936786e+03
annual_deforestation_rate_2015_2020 624.0 -2.605186e-01 7.045008e-01 -3.564301 -0.616794 -0.290045 0.000000 1.130403e+00
index_of_government_effectiveness_2018 624.0 -2.655185e-01 5.072811e-01 -1.581517 -0.578343 -0.245431 0.179875 1.084034e+00
index_of_corruption_control_2018 624.0 -4.824229e-01 4.479420e-01 -1.503398 -0.822813 -0.419696 -0.271244 1.265534e+00
In [14]:
merged_c['count'] = 1
merged_c.groupby(['status'])['count'].sum()
Out[14]:
status
Abandoned                        22
Cannot be confirmed              65
Ended                           108
Ongoing                         416
Planned                           2
Temporarily paused                1
Terminated ahead of schedule     10
Name: count, dtype: int64
In [15]:
merged_c.groupby(['status','region']).status.count()
Out[15]:
status                        region       
Abandoned                     Africa             5
                              Asia               9
                              Oceania            1
                              South America      7
Cannot be confirmed           Africa            16
                              Asia              14
                              South America     35
Ended                         Africa            34
                              Asia              38
                              Oceania            3
                              South America     33
Ongoing                       Africa           107
                              Asia             112
                              Oceania            5
                              South America    192
Planned                       Asia               1
                              South America      1
Temporarily paused            Africa             1
Terminated ahead of schedule  Africa             6
                              Asia               2
                              South America      2
Name: status, dtype: int64
In [16]:
df_ongoing = merged_c[merged_c.status == 'Ongoing'].groupby(['region','country_name','status']).agg({'count':'sum'}).sort_values(by = 'count', ascending = False).reset_index()

2.1 Exploratory Data Analysis¶

  1. Overview of project implementations and dataset outliers
  2. Which countries are loosing their forest at the highest rate and which countries implementing forestation more? How about forest sizes?

2.1 Part One¶

In [17]:
plt.figure(figsize = (8, 10))
plt.subplot(6,1,1)
sns.boxplot(x=merged_c.hdi_2019, color = 'lightblue')
plt.subplot(6,1,2)
sns.boxplot(x=merged_c.gdp_per_capita_usd_2019, color = 'red')
plt.subplot(6,1,3)
sns.boxplot(x=merged_c.size_in_hectare, color = 'lightblue')
plt.subplot(6,1,4)
sns.boxplot(x=merged_c.forest_loss_2020, color = 'blue')
plt.subplot(6,1,5)
sns.boxplot(x=merged_c.annual_deforestation_rate_2015_2020, color = 'lightblue')

plt.tight_layout()
  • Median human development index is .707 (with 1st and 3rd quartile ranges between .57 - .76. For GDP per capita, median value is 4,620USD (with 1st and 3rd quartile ranges between 1,816USD - 8,717USD)
  • Its notable that project's forest size is varies with greater amount of standard deviation while majority were densely sized around 10,000 hectare.
  • For forest_loss and annual_deforestation_rate_2015_2020 there are few outliers to be further explored.

Note: Since our dataset has project focused, its generic information such as country name, human development index or forest loss are same accross all projects from one country.

In [18]:
fig = px.bar(df_ongoing, x = 'country_name', y = 'count', color = 'region'
       , template="simple_white")
fig.update_layout(title_text = 'Number of REDD+ projects by Country')
fig.show()

2.1 Part Two¶

In [19]:
#extracting top values from each countries
def top_bycountries(df, col_n, top_n = 5, rev = True):
    table = {}
    for i, row in df.iterrows():
        country = row[19]
        if table.get(country):
            if table.get(country) < row[col_n]:
                table[country] = row[col_n]
        else:
            table[country] = row[col_n]
    if rev:
        sorted_table = sorted(table.items(), key = lambda x: x[1], reverse = True)
    else: 
        sorted_table = sorted(table.items(), key = lambda x: x[1], reverse = False)
        
    for entry in sorted_table[:top_n]:
        print(entry[0], ": ", entry[1])
In [20]:
# extracting top values from each projects
def top_(df, col_i, top_n =5, rev = True):
    table = []
    for i, row in df.iterrows():
        country = row[19]
        project = row[1]
        hectar = row[col_i]
        table.append((hectar, country, project))
          
    if rev: 
         table_sorted = sorted(table, reverse = True)
    else:
         table_sorted = sorted(table, reverse = False)
        
    for value, country, project in table_sorted[:top_n]:
        print(f"{country} : {value} - {project}")
In [21]:
#Need to look up each column header's index, so it can be used for the functions created
name = merged_c.columns.to_list()
for i, row in enumerate(name):
    print(i, row)
0 project_id
1 project_name
2 secondary_name
3 size_in_hectare
4 size_of_crediting_area_in_hectare
5 start_year
6 end_year
7 duration
8 project_description
9 objective_1
10 objective_2
11 objective_3
12 deforestation_drivers
13 project_type
14 id_country
15 size_of_pa_hectares
16 estimated_proportion_of_project_located_in_pa
17 status
18 region
19 country_name
20 hdi_2019
21 gdp_bln_usd_2019
22 gdp_per_capita_usd_2019
23 population (2019)
24 forest_area_2020_ha
25 forest_loss_2020
26 annual_deforestation_rate_2015_2020
27 index_of_government_effectiveness_2018
28 index_of_corruption_control_2018
29 participation_in_global_redd+_programs
30 forest_tenure_2015
31 comment
32 count

2.1 Part Three¶

Now, I am going to look up top 5 in the attributes of size_in_hectare, forest_loss_2020 and annual_deforestation_rate_2015_2020 as they are closely related to indicator of project/country's forest condition.

In [22]:
#Top 5 projects by size_in_hectare
top_(merged_c, 3, 5)
Brazil : 237765000.0 - Jurisdictional program of the State of Rondônia in Brazil
Brazil : 155915000.0 - Jurisdictional program of the State of Amazonas in Brazil
Brazil : 124796000.0 - Jurisdictional program of the State of Pará in Brazil
Brazil : 90337800.0 - Jurisdictional Program of the State of Mato Grosso in Brazil
Peru : 36877300.0 - Jurisdictional program of the Region of Loreto in Peru
In [23]:
#Top 5 countries by size_in_hectare
top_bycountries(merged_c, 3, 5)
Brazil :  237765000.0
Peru :  36877300.0
Indonesia :  31655300.0
Congo, the Democratic Republic of the :  20055900.0
Mozambique :  10500800.0
In [24]:
#Top 5 countries by forest_loss_2020
top_bycountries(merged_c, 25, rev = False)
#since forest_loss are generic information that duplicated for each projects, 
#I decided to see only by countries
Brazil :  -1453.04
Congo, the Democratic Republic of the :  -1101.376
Indonesia :  -578.939999999999
Angola :  -555.062
Tanzania, United Republic of :  -469.0
In [25]:
#Top 5 countries by reversing forest loss in 2020
top_bycountries(merged_c, 25)
China :  1936.786
India :  266.4
Chile :  122.924
Viet Nam :  116.246
Philippines :  34.8880000000001
In [26]:
#Top 10 countries by the lowest annual_deforestation_rate_2015_2020
top_bycountries(merged_c, 26,10, rev = False)
Côte d'Ivoire :  -3.56430144450434
Nicaragua :  -2.70120239071242
Cambodia :  -1.82526832300998
Malawi :  -1.77500102488705
Uganda :  -1.67673262256369
Paraguay :  -1.64985182757532
Egypt :  -1.463091361
Niger :  -1.11222334352137
Tanzania, United Republic of :  -0.994853448065125
Myanmar :  -0.985164093818602
In [27]:
#Top 5 countries by the highest annual_deforestation_rate_2015_2020
top_bycountries(merged_c, 26, 10)
Uruguay :  1.13040324512219
China :  0.904478289907962
Viet Nam :  0.81333744346328
Chile :  0.689026609441834
Fiji :  0.596481536852722
Costa Rica :  0.548233894235706
Kenya :  0.498523542120721
Philippines :  0.492519097736999
Rwanda :  0.44054569626728
India :  0.373324586928514

Summary:

  • Ongoing projects are mostly (1st to 3rd quartile) initiated from countries that fall into 'medium to high' Human Development Index and 1,816USD - 8,717USD for GDP per capita.
  • Project's forest size is varies with greater amount of standard deviation while majority were densely sized around 10,000 hectare or 100sq.km.
  • Forest loss in 2022 had few outliers, namely Brazil's forest was decreasing at rate that unprecedent level relative to other countries. On the other hand, China was the otherside of outlier in terms of reforestation programs.
  • For 5 years of deforestation rate (2015-2020), there were few negative outliers and the lowest two countries were Côte d'Ivoire and Nicaragua. There was only one positive outlier, which was Uruguay.

2.2 Confirmatory Data Analysis¶

Based on what I have learned from above sections, i have developed the following hypothesis to test it. Of course there can be more questions can be verified, but for sake of reports length the numbers of questions were limited. Hypothesis:

  1. Are GDP per capita lower countries are more likely to implement REDD+ projects?
  2. Is there any correlation between size of country's forest area and numbers of REDD+ project implementation? How about forest loss?

2.2 Part One¶

In [28]:
#lets slice our dataset for further analysis
table = merged_c.groupby(['region', 'country_name','gdp_per_capita_usd_2019', 'forest_area_2020_ha'])['count'].sum().sort_values(ascending = False).reset_index()
table
Out[28]:
region country_name gdp_per_capita_usd_2019 forest_area_2020_ha count
0 South America Brazil 8717.200 496619.60 77
1 South America Colombia 6432.400 59141.91 57
2 Asia Indonesia 4135.600 92133.20 54
3 Asia China 10261.700 219978.18 48
4 South America Peru 6977.700 72330.37 37
... ... ... ... ... ...
59 Africa Liberia 621.900 7617.44 1
60 Africa Guinea-Bissau 697.800 1980.01 1
61 Africa Egypt 3020.000 44.98 1
62 Africa Congo 2011.100 21946.00 1
63 South America Venezuela, Bolivarian Republic of 3410.845 46230.90 1

64 rows × 5 columns

In [29]:
fig = px.scatter(table, y = 'count', x = 'gdp_per_capita_usd_2019', color = 'region'
                 , trendline = 'ols', marginal_y="violin", marginal_x="box", template="simple_white"
                , hover_name = 'country_name', size = 'forest_area_2020_ha')
fig.show()
  • Seeing from scatter plot for correlation between gdp per capita and numbers of REDD+ projecast:
    • South American has highest median gdp per capita among other regions (6,432USD), though correlation seems slightly negative. By number of projects, South America is also leading, for example only Brazil has 77 and Colombia 57 projects. Though excluding those few outliers like Brazil, the numbers may change significantly.
    • Africa also seems to have negative correlation with median gdp per capita of 897USD. Kenya has 29 projects and followed by Congo (21) and Uganda (20).
    • Asia appears to have positive correlation, and by number projects the leading countries are Indonesia (54) and China (48). Median gdp per capita is 2,780USD.
    • Oceana has the lowest numbers of projects lead by Fiji (4) and median gdp per capita is 2,951USD.
  • Countries with higher forest area tend to have higher numbers of REDD+ projects
    • Size of the plot reflects the country's forest area size. And it appears bigger sized circle were mostly concentrated on upper section of graph which implicitly supports higher forest land more numbers of REDD+ projects.
In [30]:
df_plot = merged_c.drop(['project_id', 'project_name', 'secondary_name','start_year'
                             , 'end_year','project_description', 'objective_1', 'objective_2'
                             ,'objective_3', 'deforestation_drivers', 'project_type'
                             , 'id_country', 'participation_in_global_redd+_programs'
                             , 'forest_tenure_2015', 'comment', 'count', 'country_name'
                             , 'size_of_crediting_area_in_hectare'] 
                             , axis = 1)
df_plot.columns
Out[30]:
Index(['size_in_hectare', 'duration', 'size_of_pa_hectares',
       'estimated_proportion_of_project_located_in_pa', 'status', 'region',
       'hdi_2019', 'gdp_bln_usd_2019', 'gdp_per_capita_usd_2019',
       'population (2019)', 'forest_area_2020_ha', 'forest_loss_2020',
       'annual_deforestation_rate_2015_2020',
       'index_of_government_effectiveness_2018',
       'index_of_corruption_control_2018'],
      dtype='object')
In [31]:
plt.figure(figsize = (10,8))
heatmap = sns.heatmap(df_plot.corr(), annot = True, vmin = -1, vmax = 1, cmap='coolwarm')
heatmap.set_title('Correlation Heatmap', fontdict = {'fontsize':14}, pad = 12)
Out[31]:
Text(0.5, 1.0, 'Correlation Heatmap')

From my datasample of REDD+ ongoing projects, there are few interesting high correlations were identified for further analysis:

  • Positive correlation between forest loss and GDP (.7), population (.62). Yet, it has also positive correlation with government effectiveness (.47) which is almost equal to annual deforestation rate (.46). And forest loss has negative correlation with forest size (-.41).
  • Correlation index for annual deforestation rate has exhibited somewhat similar trends with foress loss. Has strong positive correlation with GDP (.51), population (.5), forest loss (.46), index of corruption control (.46) and GDP per capita (.44). And there is no significant negative correlation.

Intuitively it make sense that country has high GDP tend to have high economic services such as agriculture, manufacturing and as GDP has stronger correlation with population, index of government effectiveness and corruption control, all those indirect correlations were reflected in the headmap.

2.2 Part Two¶

In [32]:
#creating new column for caterogirizing gdp (PPP)
def gdp_bins(gdp):
    if gdp > 15000:
        return 'Very High'
    if 10000 < gdp < 14999:
        return 'High'
    if 5000 < gdp < 9999:
        return 'Medium'
    else:
        return 'Low'

merged_c['gdp_bins'] = merged_c['gdp_per_capita_usd_2019'].apply(gdp_bins)
print(merged_c.gdp_bins[:3])
0    Medium
1    Medium
2    Medium
Name: gdp_bins, dtype: object
In [33]:
fig_1 = px.histogram(merged_c, x = 'gdp_per_capita_usd_2019', nbins = 10
                   , title = 'Are GDP per capita lower countries are more likely to implement REDD+ projects?'
                   , color_discrete_sequence=['indianred'])
fig_1.show()

skewness_gdp = merged_c.gdp_per_capita_usd_2019.skew()

print('Skewness: ', skewness_gdp)
print('\n')

# Chi_square test to check 
Ho = "GDP per capita has no effect on numbers of REDD+ projects"   # Stating the Null Hypothesis
Ha = "GDP per capita has effect on numbers of REDD+ projects"   # Stating the Alternate Hypothesis

crosstab = pd.crosstab(merged_c['gdp_per_capita_usd_2019'], merged_c['gdp_bins'])
# Contingency table 

chi, p_value, dof, expected =  stats.chi2_contingency(crosstab)

if p_value < 0.01:  # Setting our significance level at 1%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
print('\n')
print("p_value is: ", p_value)
Skewness:  0.6160996708978211


GDP per capita has effect on numbers of REDD+ projects as the p_value (0.0) < 0.01


p_value is:  6.424738786483014e-276
In [34]:
merged_c.columns
Out[34]:
Index(['project_id', 'project_name', 'secondary_name', 'size_in_hectare',
       'size_of_crediting_area_in_hectare', 'start_year', 'end_year',
       'duration', 'project_description', 'objective_1', 'objective_2',
       'objective_3', 'deforestation_drivers', 'project_type', 'id_country',
       'size_of_pa_hectares', 'estimated_proportion_of_project_located_in_pa',
       'status', 'region', 'country_name', 'hdi_2019', 'gdp_bln_usd_2019',
       'gdp_per_capita_usd_2019', 'population (2019)', 'forest_area_2020_ha',
       'forest_loss_2020', 'annual_deforestation_rate_2015_2020',
       'index_of_government_effectiveness_2018',
       'index_of_corruption_control_2018',
       'participation_in_global_redd+_programs', 'forest_tenure_2015',
       'comment', 'count', 'gdp_bins'],
      dtype='object')

2.2 Part Three¶

In [35]:
#creating new column for caterogirizing gdp (PPP)
def frst_bins(gdp):
    if 0 < gdp < 100000:
        return 'L1'
    if 99999 < gdp < 200000:
        return 'L2'
    if 199999 < gdp < 300000:
        return 'L3'
    if 299999 < gdp < 400000:
        return 'L4'
    else:
        return 'L5'

merged_c['forest_bins'] = merged_c['forest_area_2020_ha'].apply(frst_bins)
print(merged_c.forest_bins[:3])
0    L1
1    L1
2    L1
Name: forest_bins, dtype: object
In [36]:
fig_2 = px.histogram(merged_c, x = 'forest_area_2020_ha', nbins = 10
                   , title = "Is there any correlation between size of country's forest area and numbers of REDD+ project implementation?"
                   , color_discrete_sequence=['green'])
fig_2.show()

skewness_frst = merged_c.forest_area_2020_ha.skew()

print('Skewness: ', skewness_frst)
print('\n')

# Chi_square test to check 
Ho = "Size of forest area has effect on numbers of REDD+ projects"   # Stating the Null Hypothesis
Ha = "Size of forest area has no effect on numbers of REDD+ projects"   # Stating the Alternate Hypothesis

crosstab = pd.crosstab(merged_c['forest_area_2020_ha'], merged_c['forest_bins'])
# Contingency table 

chi, p_value, dof, expected =  stats.chi2_contingency(crosstab)

if p_value < 0.01:  # Setting our significance level at 1%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
print('\n')
print("p_value is: ", p_value)
Skewness:  1.7816248476554384


Size of forest area has no effect on numbers of REDD+ projects as the p_value (0.0) < 0.01


p_value is:  6.424738786484476e-276
In [37]:
fig_3 = px.histogram(merged_c, x = 'annual_deforestation_rate_2015_2020', nbins = 10
                   , title = "Does forest loss has impact on numbers of REDD+ project?"
                   , color_discrete_sequence=['blue'])
fig_3.show()

skewness_frstloss = merged_c.annual_deforestation_rate_2015_2020.skew()

print('Skewness: ', skewness_frstloss)
print('\n')

# Chi_square test to check 
Ho = "Annual forest loss rate has effect on numbers of REDD+ projects"   # Stating the Null Hypothesis
Ha = "Annual forest loss rate has no effect on numbers of REDD+ projects"   # Stating the Alternate Hypothesis

crosstab = pd.crosstab(merged_c['annual_deforestation_rate_2015_2020'], merged_c['forest_bins'])
# Contingency table 

chi, p_value, dof, expected =  stats.chi2_contingency(crosstab)

if p_value < 0.01:  # Setting our significance level at 1%
    print(f'{Ha} as the p_value ({p_value.round(3)}) < 0.01')
else:
    print(f'{Ho} as the p_value ({p_value.round(3)}) > 0.01')
    
print('\n')
print("p_value is: ", p_value)
Skewness:  -0.7137900186619043


Annual forest loss rate has no effect on numbers of REDD+ projects as the p_value (0.0) < 0.01


p_value is:  5.882485968225877e-279

Conclusion¶

Through this data analysis practice, i am focused on REDD+ projects in relative to economic parameters of hosting country. to be continued...

Reference¶

[1] Griscom BW, Adams J, Ellis PW, Houghton RA, Lomax G, Miteva DA, Schlesinger WH, Shoch D, Siikama¨ ki JV, Smith P et al.: Natural climate solutions. Proc Natl Acad Sci 2017, 114:11645-11650